[core] Implement `RecurseCountsThreadLocal` to be used in `gCoreMutex` #19799

guitargeek · 2025-09-01T23:43:09Z

The RecurseCountsThreadLocal makes the assumption that only two instances are ever created per process. Therefore, its implementation can be optimized to the point that it should be as fast or faster than RecurseCountsTBBUnique, avoiding the Core dependence on TBB without compromising performance.

github-actions · 2025-09-02T02:09:36Z

Test Results

16 files 16 suites 2d 13h 47m 55s ⏱️
3 618 tests 3 617 ✅ 0 💤 1 ❌
57 538 runs 57 537 ✅ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit d7797bc.

♻️ This comment has been updated with latest results.

bendavid · 2025-09-02T12:16:04Z

I'm confused, doesn't RecurseCountsThreadLocal still depend on tbb::enumerable_thread_specific in this case?

The `RecurseCountsThreadLocal` makes the assumption that only two instances are ever created per process. Therefore, its implementation can be optimized to the point that it *should* be as fast or faster than `RecurseCountsTBBUnique`, avoiding the Core dependence on TBB without compromising performance.

As described in root-project#19798, we have an old build option that was supposedly removed, but it is still considered in the implementation of `ROOT::gCoreMutex`. It would be good to know what happens when testing ROOT with it (no differences are expected from the performance optimization with TBB, and actually it was always enabled when building with `builtin_tbb=ON` as well, as we do for macOS and Windows).

guitargeek · 2025-09-02T12:19:03Z

Oh, pardon I forgot to remove the tbb::enumerable_thread_specific<LocalCounts> fLocalCounts data member! It's updated now.

bendavid · 2025-09-02T12:28:10Z

So there's a maybe longstanding issue here related to bd1894b and discussed in #6919 (comment)

Given that this is similar to the old implementation in

root/core/thread/inc/ROOT/TReentrantRWLock.hxx

Lines 27 to 86 in 5028941

    
           struct UniqueLockRecurseCount { 
        
              using Hint_t = TVirtualRWMutex::Hint_t; 
        
              struct LocalCounts { 
        
                 size_t fReadersCount = 0; 
        
                 bool fIsWriter = false; 
        
              }; 
        
              size_t fWriteRecurse = 0; ///<! Number of re-entry in the lock by the same thread. 
        
              UniqueLockRecurseCount(); 
        
              using local_t = LocalCounts*; 
        
              local_t GetLocal(){ 
        
                 TTHREAD_TLS_DECL(LocalCounts, gLocal); 
        
                 return &gLocal; 
        
              } 
        
              Hint_t *IncrementReadCount(local_t &local) { 
        
                 ++(local->fReadersCount); 
        
                 return reinterpret_cast<TVirtualRWMutex::Hint_t *>(&(local->fReadersCount)); 
        
              } 
        
              template <typename MutexT> 
        
              Hint_t *IncrementReadCount(local_t &local, MutexT &) { 
        
                 return IncrementReadCount(local); 
        
              } 
        
              Hint_t *DecrementReadCount(local_t &local) { 
        
                 --(local->fReadersCount); 
        
                 return reinterpret_cast<TVirtualRWMutex::Hint_t *>(&(local->fReadersCount)); 
        
              } 
        
              template <typename MutexT> 
        
              Hint_t *DecrementReadCount(local_t &local, MutexT &) { 
        
                 return DecrementReadCount(local); 
        
              } 
        
              void ResetReadCount(local_t &local, int newvalue) { 
        
                 local->fReadersCount = newvalue; 
        
              } 
        
              bool IsCurrentWriter(local_t &local) { return local->fIsWriter; } 
        
              bool IsNotCurrentWriter(local_t &local) { return !local->fIsWriter; } 
        
              void SetIsWriter(local_t &local) 
        
              { 
        
                 // if (fWriteRecurse == std::numeric_limits<decltype(fWriteRecurse)>::max()) { 
        
                 //    ::Fatal("TRWSpinLock::WriteLock", "Too many recursions in TRWSpinLock!"); 
        
                 // } 
        
                 ++fWriteRecurse; 
        
                 local->fIsWriter = true; 
        
              } 
        
              void DecrementWriteCount() { --fWriteRecurse; } 
        
              void ResetIsWriter(local_t &local) { local->fIsWriter = false; } 
        
              size_t &GetLocalReadersCount(local_t &local) { return local->fReadersCount; } 
        
           };

guitargeek · 2025-09-02T12:33:57Z

Oh! I see. Interesting

guitargeek · 2025-09-02T12:52:52Z

I wonder how TBB avoids this problem that Philippe described...

pcanal · 2025-09-02T18:41:20Z

I wonder how TBB avoids this problem that Philippe described...

The difference between UniqueLockRecurseCount and RecurseCounts (and likely the TBB implementation) was switching from using thread_local (which, if I recall correctly, requires/required via a call to tls_get_addr_tail to indirectly take the same lock as ...something else maybe related to library loading or more likely static initialization) to only using std::this_thread::get_id().

The real challenge with this area is that seeing it work in most (all?) cases does not tell us we don't have the problem ... as I was unable at the time to deterministically reproduce the issue (see log bd1894b).

At the very least, one should revert bd1894b (i.e. using the old RecurseCounts) to check if we see the problem. If we see the problem then we have some degree of confidence that having the alternative code not failing means that they might not have the same problem. If we do not see the problem then ... we can not tell :(

pcanal · 2025-09-02T21:18:24Z

In tlslock.tar.gz I got to the spirit of what I remember

. go
Starting execution
Start thread 1 workload
Start thread 2 workload
Thread 2 sleeps 2 seconds while holding the lock
Inner most counter for thread 0x16f7eb000 is 1
Thread 1 will request the lock
Thread 2 done sleeping and will initialize a thread_local
# Process is stuck forever.

on macos and linux. The process is actually stuck in the ROOT lock ....

Updating the example by commenting out ROOT::EnableThreadSafety(); in main.cxx shows a slightly different behavior but then again it is not using gSystem is a thread safe way

thread 2 is waiting for the lock:

#0  0x00007ffff3e873f0 in __lll_lock_wait () from /lib64/libc.so.6
#1  0x00007ffff3e8d582 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
#2  0x0000000000404505 in __gthread_mutex_lock (__mutex=0x4081c0 <gMainMutex>) at /cvmfs/cms.cern.ch/el9_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/include/c++/11.4.1/x86_64-redhat-linux-gnu/bits/gthr-default.h:749
#3  0x0000000000404634 in std::mutex::lock (this=0x4081c0 <gMainMutex>) at /cvmfs/cms.cern.ch/el9_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/include/c++/11.4.1/bits/std_mutex.h:100
#4  0x00007ffff45374ad in innermost () at inner.cxx:21

while the other thread is waiting on the system lock

(gdb) thread 3
[Switching to thread 3 (LWP 4021866)]
#0  0x00007ffff3e873f0 in __lll_lock_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff3e873f0 in __lll_lock_wait () from /lib64/libc.so.6
#1  0x00007ffff3e8d5dd in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
#2  0x00007ffff7fd3393 in _dl_open (file=0x7fffcc000c40 "/var/tmp/tlslock/libWithThreadLocal.so", mode=-2147483391,
    caller_dlopen=0x7fffe4cbf2be <cling::utils::platform::DLOpen(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+30>,
    nsid=-2, argc=1, argv=0x7fffffffbc18, env=0x7fffec0226f0) at dl-open.c:819

it is not clear whether the example is too convoluted.

pcanal · 2025-09-03T00:28:10Z

This second example does also show the essence of the problem without ROOT.
tlslock2.tar.gz

Side note: I am guessing the example can be simply further (by removing one of the layer of libraries)

bendavid · 2025-09-03T12:37:52Z

One thing I'm not sure I understood here. Would it be sufficient to ensure that the thread_local variables are initialized during the loading of libThread (and/or in the subsequent creation of new therads)? (as opposed to potentially later at the first call to GetLocal()?

pcanal · 2025-09-03T15:05:03Z

Would it be sufficient to ensure that the thread_local variables are initialized during the loading of libThread

it can not be done, can it? The official thread_local variables need to be initialized for each (new) threads.

bendavid · 2025-09-03T15:15:12Z

Yes, but in this case it could/should be initialized for all existing threads at library load time and for any additional threads immediately when the thread is created.

(As opposed to when execution reaches some particular local scope, I think as in

https://en.cppreference.com/w/cpp/language/storage_duration.html#Static_block_variables

)

pcanal · 2025-09-03T15:30:25Z

Yes, but in this case it could/should be initialized for all existing threads

I can see how that would work for a static variable .. but I am confused on how it could work for a thread_local since there is an infinite (indeterminated) number of (upcoming) threads.

bendavid · 2025-09-03T17:15:43Z

Right, for threads created after the library is loaded I would assume there are some cases where the thread_local variables would be initialized immediately upon the creation of the thread. (but not necessarily for the thread_local block variable case I linked above)

pcanal · 2025-09-03T17:46:47Z

Right, for threads created after the library is loaded I would assume there are some cases where the thread_local variables would be initialized immediately upon the creation of the thread. (but not necessarily for the thread_local block variable case I linked above)

Humm ... intriguing ... We might indeed be able to use a 'thread_local` declared in the global scope rather than inside a function in order to avoid the issue.

bendavid · 2025-09-03T17:58:48Z

Yes, or possibly as a static thread_local class data member

guitargeek self-assigned this Sep 1, 2025

guitargeek requested a review from dpiparo as a code owner September 1, 2025 23:43

guitargeek force-pushed the issue-19798 branch from c28e866 to bc59f1a Compare September 2, 2025 11:58

guitargeek requested a review from pcanal as a code owner September 2, 2025 11:58

guitargeek changed the title ~~[ci] Check what happens when building with tbb=ON~~ [core] Implement RecurseCountsThreadLocal to be used in gCoreMutex Sep 2, 2025

guitargeek mentioned this pull request Sep 2, 2025

Removed option tbb is still considered to enable secret optimization of ROOT::gCoreMutex #19798

Open

guitargeek force-pushed the issue-19798 branch from bc59f1a to 871b9f9 Compare September 2, 2025 12:04

guitargeek added 2 commits September 2, 2025 14:18

guitargeek force-pushed the issue-19798 branch from 871b9f9 to d7797bc Compare September 2, 2025 12:18

guitargeek marked this pull request as draft September 2, 2025 12:52

[core] Implement RecurseCountsThreadLocal to be used in gCoreMutex #19799

Are you sure you want to change the base?

[core] Implement RecurseCountsThreadLocal to be used in gCoreMutex #19799

Conversation

guitargeek commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

bendavid commented Sep 2, 2025

Uh oh!

guitargeek commented Sep 2, 2025

Uh oh!

bendavid commented Sep 2, 2025

Uh oh!

guitargeek commented Sep 2, 2025

Uh oh!

guitargeek commented Sep 2, 2025

Uh oh!

pcanal commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcanal commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcanal commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bendavid commented Sep 3, 2025

Uh oh!

pcanal commented Sep 3, 2025

Uh oh!

bendavid commented Sep 3, 2025

Uh oh!

pcanal commented Sep 3, 2025

Uh oh!

bendavid commented Sep 3, 2025

Uh oh!

pcanal commented Sep 3, 2025

Uh oh!

bendavid commented Sep 3, 2025

Uh oh!

Uh oh!

[core] Implement `RecurseCountsThreadLocal` to be used in `gCoreMutex` #19799

[core] Implement `RecurseCountsThreadLocal` to be used in `gCoreMutex` #19799

guitargeek commented Sep 1, 2025 •

edited

Loading

github-actions bot commented Sep 2, 2025 •

edited

Loading

pcanal commented Sep 2, 2025 •

edited

Loading

pcanal commented Sep 2, 2025 •

edited

Loading

pcanal commented Sep 3, 2025 •

edited

Loading